qa model
One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning
Wang, Mengyu, Sabanis, Sotirios, de Carvalho, Miguel, Cohen, Shay B., Ma, Tiejun
Domain-specific quantitative reasoning remains a major challenge for large language models (LLMs), especially in fields requiring expert knowledge and complex question answering (QA). In this work, we propose Expert Question Decomposition (EQD), an approach designed to balance the use of domain knowledge with computational efficiency. EQD is built on a two-step fine-tuning framework and guided by a reward function that measures the effectiveness of generated sub-questions in improving QA outcomes. It requires only a few thousand training examples and a single A100 GPU for fine-tuning, with inference time comparable to zero-shot prompting. Beyond its efficiency, EQD outperforms state-of-the-art domain-tuned models and advanced prompting strategies. We evaluate EQD in the financial domain, characterized by specialized knowledge and complex quantitative reasoning, across four benchmark datasets. Our method consistently improves QA performance by 0.6% to 10.5% across different LLMs. Our analysis reveals an important insight: in domain-specific QA, a single supporting question often provides greater benefit than detailed guidance steps.
Integrating Neural and Symbolic Components in a Model of Pragmatic Question-Answering
Tsvilodub, Polina, Hawkins, Robert D., Franke, Michael
Computational models of pragmatic language use have traditionally relied on hand-specified sets of utterances and meanings, limiting their applicability to real-world language use. We propose a neuro-symbolic framework that enhances probabilistic cognitive models by integrating LLM-based modules to propose and evaluate key components in natural language, eliminating the need for manual specification. Through a classic case study of pragmatic question-answering, we systematically examine various approaches to incorporating neural modules into the cognitive model -- from evaluating utilities and literal semantics to generating alternative utterances and goals. We find that hybrid models can match or exceed the performance of traditional probabilistic models in predicting human answer patterns. However, the success of the neuro-symbolic model depends critically on how LLMs are integrated: while they are particularly effective for proposing alternatives and transforming abstract goals into utilities, they face challenges with truth-conditional semantic evaluation. This work charts a path toward more flexible and scalable models of pragmatic language use while illuminating crucial design considerations for balancing neural and symbolic components.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- North America > United States (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- (5 more...)
Improving QA Efficiency with DistilBERT: Fine-Tuning and Inference on mobile Intel CPUs
Question answering (QA) systems have become a cornerstone of natural language processing (NLP), enabling machines to extract precise answers from textual contexts. The Stanford Question Answering Dataset (SQuAD) v1.1 [Rajpurkar et al., 2016] is a widely adopted benchmark for evaluating QA models, comprising over 87,000 training examples of context-question-answer triples. While transformer-based models like BERT [Devlin et al., 2019] have achieved state-of-the-art performance on SQuAD, their computational complexity often demands GPU acceleration, limiting deployment on resource-constrained devices like mid-range CPUs. This study addresses the challenge of developing a transformer-based QA model optimized for inference on a 13th Gen Intel i7-1355U CPU, a 10-core processor with a 5.0 GHz turbo frequency. We focus on DistilBERT [Sanh et al., 2020], a lightweight transformer, to balance performance--measured by F1 score and accuracy--with inference speed. Our contributions include: Comprehensive exploratory data analysis (EDA) of SQuAD v1.1 to inform modeling decisions. Data augmentation strategies to enhance model robustness to low-overlap question-context pairs.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Colorado (0.04)
- Africa > Rwanda > Kigali > Kigali (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.56)
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Ho, Xanh, Huang, Jiahao, Boudin, Florian, Aizawa, Akiko
Extractive reading comprehension question answering (QA) datasets are typically evaluated using Exact Match (EM) and F1-score, but these metrics often fail to fully capture model performance. With the success of large language models (LLMs), they have been employed in various tasks, including serving as judges (LLM-as-a-judge). In this paper, we reassess the performance of QA models using LLM-as-a-judge across four reading comprehension QA datasets. We examine different families of LLMs and various answer types to evaluate the effectiveness of LLM-as-a-judge in these tasks. Our results show that LLM-as-a-judge is highly correlated with human judgments and can replace traditional EM/F1 metrics. By using LLM-as-a-judge, the correlation with human judgments improves significantly, from 0.22 (EM) and 0.40 (F1-score) to 0.85. These findings confirm that EM and F1 metrics underestimate the true performance of the QA models. While LLM-as-a-judge is not perfect for more difficult answer types (e.g., job), it still outperforms EM/F1, and we observe no bias issues, such as self-preference, when the same model is used for both the QA and judgment tasks.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > Canada (0.04)
- (14 more...)
- Leisure & Entertainment (0.93)
- Education > Assessment & Standards > Student Performance (0.75)
Uncertainty Quantification in Retrieval Augmented Question Answering
Perez-Beltrachini, Laura, Lapata, Mirella
Retrieval augmented Question Answering (QA) helps QA models overcome knowledge gaps by incorporating retrieved evidence, typically a set of passages, alongside the question at test time. Previous studies show that this approach improves QA performance and reduces hallucinations, without, however, assessing whether the retrieved passages are indeed useful at answering correctly. In this work, we propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with. We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods. Code and data are available at https://github.com/lauhaide/ragu.
- Asia > Middle East (0.28)
- Oceania > Australia (0.14)
- North America > Canada (0.14)
- (9 more...)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
- Law (1.00)
- Government > Regional Government > North America Government > United States Government (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.67)
Quantifying Memorization and Retriever Performance in Retrieval-Augmented Vision-Language Models
Carragher, Peter, Jha, Abhinand, Raghav, R, Carley, Kathleen M.
Large Language Models (LLMs) demonstrate remarkable capabilities in question answering (QA), but metrics for assessing their reliance on memorization versus retrieval remain underdeveloped. Moreover, while finetuned models are state-of-the-art on closed-domain tasks, general-purpose models like GPT-4o exhibit strong zero-shot performance. This raises questions about the trade-offs between memorization, generalization, and retrieval. In this work, we analyze the extent to which multimodal retrieval-augmented VLMs memorize training data compared to baseline VLMs. Using the WebQA benchmark, we contrast finetuned models with baseline VLMs on multihop retrieval and question answering, examining the impact of finetuning on data memorization. To quantify memorization in end-to-end retrieval and QA systems, we propose several proxy metrics by investigating instances where QA succeeds despite retrieval failing. Our results reveal the extent to which finetuned models rely on memorization. In contrast, retrieval-augmented VLMs have lower memorization scores, at the cost of accuracy (72% vs 52% on WebQA test set). As such, our measures pose a challenge for future work to reconcile memorization and generalization in both Open-Domain QA and joint Retrieval-QA tasks.
AmaSQuAD: A Benchmark for Amharic Extractive Question Answering
Hailemariam, Nebiyou Daniel, Guda, Blessed, Tefferi, Tsegazeab
This research presents a novel framework for translating extractive question-answering datasets into low-resource languages, as demonstrated by the creation of the AmaSQuAD dataset, a translation of SQuAD 2.0 into Amharic. The methodology addresses challenges related to misalignment between translated questions and answers, as well as the presence of multiple answer instances in the translated context. For this purpose, we used cosine similarity utilizing embeddings from a fine-tuned BERT-based model for Amharic and Longest Common Subsequence (LCS). Additionally, we fine-tune the XLM-R model on the AmaSQuAD synthetic dataset for Amharic Question-Answering. The results show an improvement in baseline performance, with the fine-tuned model achieving an increase in the F1 score from 36.55% to 44.41% and 50.01% to 57.5% on the AmaSQuAD development dataset. Moreover, the model demonstrates improvement on the human-curated AmQA dataset, increasing the F1 score from 67.80% to 68.80% and the exact match score from 52.50% to 52.66%.The AmaSQuAD dataset is publicly available Datasets
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)
- Europe > Italy > Tuscany > Florence (0.04)
- Africa > Ethiopia > Amhara Region > Bahir Dar (0.04)
Finnish SQuAD: A Simple Approach to Machine Translation of Span Annotations
Nuutinen, Emil, Rastas, Iiro, Ginter, Filip
We apply a simple method to machine translate datasets with span-level annotation using the DeepL MT service and its ability to translate formatted documents. Using this method, we produce a Finnish version of the SQuAD2.0 question answering dataset and train QA retriever models on this new dataset. We evaluate the quality of the dataset and more generally the MT method through direct evaluation, indirect comparison to other similar datasets, a backtranslation experiment, as well as through the performance of downstream trained QA models. In all these evaluations, we find that the method of transfer is not only simple to use but produces consistently better translated data. Given its good performance on the SQuAD dataset, it is likely the method can be used to translate other similar span-annotated datasets for other tasks and languages as well. All code and data is available under an open license: data at HuggingFace TurkuNLP/squad_v2_fi, code on GitHub TurkuNLP/squad2-fi, and model at HuggingFace TurkuNLP/bert-base-finnish-cased-squad2.
- Europe > Faroe Islands > Streymoy > Tórshavn (0.05)
- Europe > Estonia > Tartu County > Tartu (0.05)
- North America > United States (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Deceiving Question-Answering Models: A Hybrid Word-Level Adversarial Approach
Li, Jiyao, Ni, Mingze, Gong, Yongshun, Liu, Wei
Deep learning underpins most of the currently advanced natural language processing (NLP) tasks such as textual classification, neural machine translation (NMT), abstractive summarization and question-answering (QA). However, the robustness of the models, particularly QA models, against adversarial attacks is a critical concern that remains insufficiently explored. This paper introduces QA-Attack (Question Answering Attack), a novel word-level adversarial strategy that fools QA models. Our attention-based attack exploits the customized attention mechanism and deletion ranking strategy to identify and target specific words within contextual passages. It creates deceptive inputs by carefully choosing and substituting synonyms, preserving grammatical integrity while misleading the model to produce incorrect responses. Our approach demonstrates versatility across various question types, particularly when dealing with extensive long textual inputs. Extensive experiments on multiple benchmark datasets demonstrate that QA-Attack successfully deceives baseline QA models and surpasses existing adversarial techniques regarding success rate, semantics changes, BLEU score, fluency and grammar error rate.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (16 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Media (1.00)
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment (0.93)
Learning-to-Defer for Extractive Question Answering
Montreuil, Yannis, Carlier, Axel, Ng, Lai Xing, Ooi, Wei Tsang
Pre-trained language models have profoundly impacted the field of extractive question-answering, leveraging large-scale textual corpora to enhance contextual language understanding. Despite their success, these models struggle in complex scenarios that demand nuanced interpretation or inferential reasoning beyond immediate textual cues. Furthermore, their size poses deployment challenges on resource-constrained devices. Addressing these limitations, we introduce an adapted two-stage Learning-to-Defer mechanism that enhances decision-making by enabling selective deference to human experts or larger models without retraining language models in the context of question-answering. This approach not only maintains computational efficiency but also significantly improves model reliability and accuracy in ambiguous contexts. We establish the theoretical soundness of our methodology by proving Bayes and $(\mathcal{H}, \mathcal{R})$--consistency of our surrogate loss function, guaranteeing the optimality of the final solution. Empirical evaluations on the SQuADv2 dataset illustrate performance gains from integrating human expertise and leveraging larger models. Our results further demonstrate that deferring a minimal number of queries allows the smaller model to achieve performance comparable to their larger counterparts while preserving computing efficiency, thus broadening the applicability of pre-trained language models in diverse operational environments.
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
- Asia > Singapore > Central Region > Singapore (0.04)
- Asia > Middle East > Jordan (0.04)
- (3 more...)